Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid deep copy on lz4 decompression #7437

Merged
merged 4 commits into from
Dec 29, 2022
Merged

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Dec 28, 2022

Speed up deserialization when
a. lz4 is installed, and
b. the buffer is compressible, and
c. the buffer is smaller than 64 MiB (distributed.comm.shard)
Note that the default chunk size in dask.array is 128 MiB.

Note that this does not prevent a memory flare, as there's an unnecessary deep copy upstream as well:
https://github.com/python-lz4/python-lz4/blob/79370987909663d4e6ef743762768ebf970a2383/lz4/block/_block.c#L256

@crusaderky crusaderky self-assigned this Dec 28, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Dec 28, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       22 files  ±  0         22 suites  ±0   9h 57m 15s ⏱️ - 23m 5s
  3 283 tests +  5    3 196 ✔️ +18       85 💤 ±  0  2  - 13 
36 044 runs  +55  34 501 ✔️ +93  1 541 💤 +24  2  - 62 

For more details on these failures, see this check.

Results for commit 9b14c35. ± Comparison against base commit f3995b5.

♻️ This comment has been updated with latest results.

Copy link
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle this seems fine. I did have a couple of small questions though.

x = np.arange(1000000, dtype="int64")
compression, payload = maybe_compress(x.data)
assert compression == "lz4"
assert compression in {"lz4", "snappy", "zstd", "zlib"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would be sad if we used zlib by default in any configuration. I'll bet that it's faster to just send data uncompressed over the network.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right. I misread the code; default compression is lz4 -> snappy -> None.
I've amended the tests and added a specific test for the priority order.

@@ -217,7 +217,6 @@ def test_itemsize(dt, size):


def test_compress_numpy():
pytest.importorskip("lz4")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, why this change? If we didn't have lz4, snappy, or zstandard installed (all of which are optional I think) then I'd expect this to fail.

The only compressor we have by default, I think, is zlib, and we don't compress with that by default.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if you have snappy but not lz4 it will succeed.
zstandard does not install itself as a default compressor.
Amended the tests to reflect this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deserialization of compressed data is sluggish and causes memory flares
2 participants